Project Report

GGAP

Author

Alyssa Shou, Grace Chang, Grace Shao, and Paisley Lucier

Published

December 4, 2023

Abstract
This report analyzes Chicago crime data from the year 2022. Firstly, the variation of crime characteristics by hour—crime type, location, and violence level—are explored to address people’s concerns and stereotypes regarding peak crime hours. The findings yield recommendations for stakeholders—anyone in Chicago—to remain vigilant throughout the day, as public crime rates during daylight hours exhibit great similarity to those during nighttime. Next, theft is examined across the government-delineated community areas of Chicago. The region composed of The Loop, Near North Side, Near West Side, and West Town exhibits the highest levels of theft, thus stakeholders are urged to pay particular attention to their belongings in these areas. Broadening our perspective, community area crime distributions in general are further analyzed with respect to CTA stations. CTA riders are recommended to be especially cautious at the Roosevelt, Jackson, and 95th/Dan Ryan stations, which are discovered to be the most dangerous, featuring high rates of theft and physical crimes. Finally, the proportion of crimes warranting arrests were assessed based on police districts, and the police are encouraged to allocate more resources to several districts exhibiting a large disparity in the proportion of crimes arrested for that particular crime type.

1 Background / Motivation

Chicago is often deemed one of the most dangerous cities in the United States, and people are typically concerned with this issue when visiting or living in the city. As the four of us are students at Northwestern—in close proximity with Chicago—we frequent the city ourselves, and have personally witnessed crimes committed. For example, on the Red Line, a common mode of transportation for many Chicago residents and tourists, there are many visible crimes that scare people away from using the subway. Therefore, we were interested in analyzing crimes in Chicago in order to see which types of crimes we should be more vigilant of, or where in Chicago we should practice extra caution. These analyses are crucial to our safety and the safety of others like us, who share similar experiences. Additionally, police inequality or inefficient allocation of resources in policing has been a prevalent issue in the United States recently, so we wanted to explore potential areas of disparity.

2 Problem statement

  1. Alyssa’s question looks at how types of crime is distributed throughout the day. At various hours of the day, it is expected that crime characteristics will vary. We wanted to know which types of crimes are most common in the day-time vs. night-time, where crimes are commonly committed, and whether or not they are violent.

  2. Grace Chang’s question examines theft, the most common type of crime, accounting for around 23% of total crimes, in further detail. We were interested in how theft was distributed across the various community areas of Chicago (e.g. The Loop), and how the population density of these areas relates to their respective theft rates.

  3. Grace Shao’s question explores the most dangerous stations and community areas to ride the CTA in. We wanted to know the typical profiles of CTA crimes, specifically, where they are committed, whether they are committed on the train or platform, and which time of day to avoid certain stations.

  4. Paisley’s question examines associations between the proportion of crimes that were arrested and the location in the city of Chicago, particularly in regards to police districts. How does the arrest proportion fluctuate depending on what part of the city the crime occurs in?

3 Data sources

3.0.1 Primary Dataset:

To conduct our analysis, our primary dataset was Chicago crime data for 2022 as reported by the city of Chicago: https://data.cityofchicago.org/Public-Safety/Crimes-2022/9hwr-2zxp

The data reports 239,043 observations of crimes in the city of Chicago (at the time of download, as this data is being updated to this day), and information about the crime such as its location, time/date, and whether or not the crime resulted in arrest.

3.0.2 Supporting datasets include:

  • Chicago community areas by numeric code, population, area, and population density: https://en.wikipedia.org/wiki/Community_areas_in_Chicago
    • Since the original dataset includes the numeric code of the community areas, to make our analysis more usable and readable, we merged the two datasets to include community area names.
  • IUCR codes https://data.cityofchicago.org/widgets/c7ck-438e
    • Used in Alyssa’s analysis solely as reference to find violent crime type IUCR codes but not actually merged with data.
  • CTA stations coordinates: https://data.cityofchicago.org/Transportation/CTA-System-Information-List-of-L-Stops/8pix-ypme
    • The latitude and longitude columms are used to find the nearest subway station of each crime.
  • Police sentiment data via the city of Chicago: https://data.cityofchicago.org/Public-Safety/Police-Sentiment-Scores/28me-84fj/data
    • This dataset is a compilation of collected survey data about residents’ feeling towards police based on their responses to 4 questions: 1) rating the safety of their neighborhood, 2) rating how they feel the police in their neighborhood listen to concerns of local residents, 3) rating how well the police in their neighborhood treat local residents with respect, and 4) rating trust in their police. (Note: responses were scored on a scale of 1-10, and the data is compiled to multiply scores by 10 so an average rating of 60 in the dataset corresponds to an average score of 6)

4 Stakeholders

Our primary purpose is to help stakeholders understand crime in the city of Chicago. This understanding helps general parties make better choices to promote public and personal safety.

  • Chicago residents/visitors: Residents and visitors will benefit from our analysis by using our recommendations to more safely navigate the city and transit stations and make housing decisions.

  • Police force: For the police force, we hope our analysis can give them direction on how to better serve and satisfy communities across districts and determine where to focus resources to create a safer Chicago.

5 Data quality check / cleaning / preparation

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them?

Did your analysis require any other kind of data preparation before it was ready to use?

Below is a table of all of the continuous variables we used in our analysis, including variables from supporting datasets.

Year Latitude Longitude Hour Population Density Safety Score Trust Score Respect Score Listening Score
count 239043.0 234936.000000 234936.000000 239043.000000 54845.000000 1164.000000 1164.000000 1164.000000 1164.000000
mean 2022.0 41.845596 -87.668599 12.317633 6984.499435 57.442328 57.462148 58.680120 56.244296
std 0.0 0.088833 0.061009 6.985090 3602.179700 5.389730 6.707499 7.096675 6.634690
min 2022.0 36.619446 -91.686566 0.000000 388.360000 33.980000 38.040000 39.310000 35.040000
25% 2022.0 41.769150 -87.710149 7.000000 4405.730000 54.137500 52.427500 53.540000 51.525000
50% 2022.0 41.862981 -87.661469 13.000000 6226.000000 57.470000 57.410000 58.420000 56.360000
75% 2022.0 41.909017 -87.626402 18.000000 9516.370000 61.380000 61.890000 63.667500 60.620000
max 2022.0 42.022548 -87.524532 23.000000 14863.580000 71.110000 77.100000 78.850000 75.630000

The following are the categorical variables included in the main dataset: * IUCR * Primary Type * Description * Location Description * Arrest * District * Community Area * Latitude * Longitude

6 Data quality check / cleaning / preparation

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them?

Did your analysis require any other kind of data preparation before it was ready to use?

Name of Variable Unique Values Missing Values Most Common Value Second Most Common Third Most Common
0 IUCR 304 0 Level: 0810, Count: 20096 Level: 0820, Count: 18863 Level: 0486, Count: 18679
1 Description 284 0 Level: SIMPLE, Count: 27207 Level: OVER $500, Count: 20096 Level: $500 AND UNDER, Count: 18863
2 Location Description 134 881 Level: STREET, Count: 67630 Level: APARTMENT, Count: 45596 Level: RESIDENCE, Count: 30470
3 Arrest 2 0 Level: False, Count: 211218 Level: True, Count: 27825
4 District 23 0 Level: 8, Count: 14811 Level: 6, Count: 14709 Level: 12, Count: 14353
5 Community Area 77 0 Level: 25, Count: 12251 Level: 8, Count: 10608 Level: 28, Count: 9496
6 Community Name (From theft data) 77 0 Level: 0, Count: 4251 Level: 1, Count: 3464 Level: 2, Count: 3373

6.0.1 Individual Data Preparation & Cleaning

Alyssa: text

Grace Chang: When initially using the data, I had to subset the data such that only the observations with the Primary Type listed as “Theft” remained. For a more thorough analysis, considering that I did not use latitude or longitude in my analysis, I did not drop any missing values—most missing values were located in those two columns. I merged this dataset with one including information on the Community Area names, such that I could pair up the numeric codes of the Community Areas as listed in the raw dataset with their popularly known names.
In order to accomplish this, I had to clean the raw theft data so that the numeric codes were formatted in the same fashion as in the community area dataset. Upon retrieving the final, merged dataset, I also cleaned the observations so that the formatting of the community names were consistent.

Grace Shao: I began by subsetting the data to only include crimes occuring in the CTA station or in a CTA train. While I did not have to change any of the values in the dataset, I did remove the observations with NA values in the longitude or latitude. I needed to remove these observations to map each observation later in the analysis. Only around 1% of my data had NA values, so the removal did not have significant repercussions to my analysis. Additionally, I added the names of the community areas in order to enhance readibility and make it easier to make my graphs. The original code only included the numeric code of the community area for each observation, so I merged my data with a Wikipedia table that included matched numeric codes with community area names.

Paisley: For my analysis concerning arrests and police districts, I did not consider the 14 observations in district 31, as this district is split with area in both the North and South side and had so few observations (less that 1% of the data). Additionally, I added a new column for my analysis that binned the data by side referencing this source: https://news.wttw.com/sites/default/files/Map%20of%20Chicago%20Police%20Districts%20and%20Beats.pdf. For much of my analysis I only considered the top 12 crimes to ensure that one district with very few observations did not skew the data. Lastly, for the police sentiments data that I worked with, I subsetted only the survey scores that were recorded in 2022 to match the crime dataset. Since the scores for safety, trust, respect, and listen were all very highly correlated with one another (correlation coefficients all > .9), I aggregated these scores by taking their mean in each district for one of my analyses.

7 Exploratory Data Analysis

7.1 Analysis 1

By <Alyssa Shou>

For my question on how crime type varies throughout the day, I started by graphing a full distribution of the number of crimes per hour. Based on this line graph, I saw that there are peaks at 12 am and 12 pm, so I used these two hours as parts of the dataset to specifically analyze.

I was also interested in analyzing rush hour time frames. Morning rush hour is ….

7.2 Analysis 2

By <Grace Chang>

Since I am conducting research on theft, I subset the data so that only data with the Primary Type “Theft” remained; therefore, I could perform my subsequent analyses on only the theft data. I firstly wanted to see how the types of theft varied across the seventy-seven government-delineated Chicago community areas. In order to perform this analysis, I looked for the top twelve community areas with the highest number of observations of theft crimes, and subset the data such that only the data for these twelve areas remained. I focused on only these twelve observations because I wanted to visualize for stakeholders which areas they should pay particular attention to in terms of how often theft crimes are observed there, and how types of theft crimes compare across those twelve areas.
In order to visualize these statistics, I decided on using a stacked bar-plot. Originally, I tried to use a series of line plots—one plot for each community area in the top twelve, featuring the number of thefts of a given type of theft during each month of the year—via the Seaborn FacetGrid method. This did not work as there was a drastic difference between the number of occurrences in certain categories of theft for some community areas during most of the months. As a result, the scale of the plots within the FacetGrid, while they were consistent, did not match up well with the scale of the lines plotted, and the visualization was difficult to view and interpret. The stacked bar-plot removed the month factor, but I realized by attempting the FacetGrid that the month did not matter much. The stacked bar-plot makes it easy to see the proportions of the various theft types and visually compares the frequencies for the different community areas.

Text(0.5, 1.0, 'Types of theft in each community area')

Based on this plot, I noticed that theft over 500 dollars dominates theft under 500 dollars in most of the districts, with a notable difference in the ‘West Town’ and ‘Near West Side’ areas, showing that theft of greater financial value is more common than that of less value—this is further supported by the percentages of each type of overall theft, where theft over 500 dollars accounts for 33%, and theft under 500 dollars around 28%. While this is not a significant difference, I concluded that financial theft in general is by far the most common type of theft. The plot also exhibits this trend, where theft of monetary assets account for the greatest proportion of thefts, and retail theft also being common. One surprising observation from this plot is that pick-pocketing only makes up a small part of thefts across the twelve community areas compared to retail theft and financial theft.
Finally, I observed that the four areas with the most theft occurrences were concentrated in the same area—The Loop, Near North Side, Near West Side, and West Town border each other, and this region is also often described as downtown Chicago by visitors and residents. This observation implies that people should be particularly wary in these areas, especially seeing as they are popular areas to live and visit. Based on this context, I further questioned whether there was a relationship between the population density of a community area and the number of thefts in that area. In order to attack this problem, I created a dataset via merging that included the community area names, their corresponding numbers of thefts, and their population densities. I then utilized this dataset to plot a linear regression relating population density and number of thefts.

Text(0, 0.5, 'Number of Thefts')

Community Areas Ranked by Population Density
     Community Name  Density (sqkm)
0   Near North Side        14863.58
4         Lake View        12752.44
15        Edgewater        12491.89
9       Rogers Park        11672.81
1          The Loop         9897.73
38      Albany Park         9732.13
10           Uptown         9516.37
6      Lincoln Park         8612.96
12       West Ridge         8435.36
55          Hermosa         7940.46
20   Belmont Cragin         7713.71
7      Logan Square         7707.48

It is important to see that while there is a general positive correlation, out of the top four community areas of interest that I discovered as exhibiting the highest numbers of theft, only two of them fall under the top twelve most population dense areas. I formulated this observation by comparing the top four with the top twelve because the top four community areas of interest were originally extracted from the top twelve community areas with the most theft overall. This observation implies that there are likely many extraneous variables in play affecting the number of thefts aside from simply the population density of an area, but based on the trend, we can assume that population density is still a significant independent variable. Seeing as the Near North Side (rank 1 in population density) and The Loop (rank 5 in population density) areas are much more dense than the West Town and Near West Side areas, but border these two areas to the right, a potential explanation could also be that there are many people who travel between these areas, and with them comes a spill-over of theft crimes into these two less population dense areas.

7.3 Analysis 3

By <Grace Shao>

I began by investigating the location of CTA crimes. In order for Chicagoans to know which locations they should avoid, the community area with the most crime is important information. I found the 10 community areas with the highest number of crimes, and graphed them below in descending order. The Loop has the most crime by far, outnumbering other community areas by a significant amount. Compared to the #2 most dangerous station, The Loop still had more than 4x the amount of crime.

This graph matches the map of CTA crimes as shown below. I wanted to create this map to visualize where crimes were happening and highlight that The Loop had a very high number of crimes, shown with the high density of points in that area. With this question, I did not anticipate that I would have to make many changes to make the map more readable. I changed the color scale to correspond with the community area, decreased opacity, and increased the zoom to focus on The Loop. This process took a lot of trial and error, especially since plotly was a new library that I had never used before.

Since The Loop represented such a large portion of the crimes committed and is an extremely popular area of Chicago (The Bean, Art Institute, and River Walk are all located there), I wanted to do further analysis on it. Subsetting the data to include only The Loop, I found the most common crimes occurring on CTA stations within that community area. For the top 6 crime types, I found that 3 were theft related and 3 were more physical and violent. Simple battery was the most common crime overall, while pickpocketing was the second most common crime.

I thought it was interesting that theft related crimes were much more likely to happen on the train. However, when looking at physical crimes, a significant proportion happened on the platform. Compared to theft, a higher proportion happened on the platform.

Now that I had established The Loop as the most dangerous community area, I wanted to also pinpoint the most dangerous stations. During my data analysis, I ran into a problem. The dataset did not include which station the crime was committed at – only the longitude and latitude. I had anticipated that the crimes might be clustered and easy to identify on a map, but I found that in areas with many stations it was very difficult to visually identify which station the crime belonged to. I decided instead to import a list of the CTA stations and their coordinates.

For each crime, I found the closest station by longitude and latitude using the Haversine formula, which is recommended for coordinates calculations because it simulates distance on a sphere\(^{1}\). I then found and created a graph of the top 3 most dangerous stations.

  1. “Haversine Formula to Find Distance between Two Points on a Sphere.” GeeksforGeeks, GeeksforGeeks, 5 Sept. 2022, www.geeksforgeeks.org/haversine-formula-to-find-distance-between-two-points-on-a-sphere/.

The map below shows the mapped out crimes for each of the top 3 most dangerous stations. Two of these stations existed in The Loop, and the other was more South. All three were Red line stations.

While knowing the top 3 most dangerous stations is important, I also wanted to know the time of day where most crimes occur. This gives people more information in case they are traveling through these stations. I found that crimes happen most just after midnight around 2 AM and during rush hour. This trend held true for all 3 stations.

Lastly, I wanted to explore just how much more dangerous Roosevelt, the most crime ridden station, was than the average station. This would be important for my recommendations to stakeholders to illustrate why safety is important around The Loop.

The Roosevelt station is 7.868409193330329 times more dangerous than the average station.

7.4 Analysis 4

By <Paisley Lucier>

The analysis question that I explored is: What are the associations between proportion of committed crimes that resulted in arrest and the police district of the occurence? In order to answer this question, I attempted many different approaches. I tried grouping the data in many ways to see if there were assocations between variables. Namely, I struggled with binning the data and effectively aggregating it. As ‘Arrest’ is a boolean value, I had to find representative and effective ways to bin the data, before landing on binning police districts by side of Chicago. Ultimately, within my analysis I still maintained nuance in regards to recommendations for specific districts. Firstly, I generally looked at the proportion of observations that resulted in arrest by each side of Chicago.

This generally shows the proportion of arrested crime across side. As we can see, the North side has the highest overall arrest proportion–higher than both the South and Central sides. However, it would be more enlightening to see the arrest proportion by type of crime: this is what the graph on the right shows.

The barplot on the right above shows the proportion of observations arrested for each of the 12 crimes with the most overall observances in the data, separated by the side the crime occurred in. Within this bar chart, we can see that the disparities in arrest prortion across sides prevail, though smaller in magnitude. This graph shows us that of the top 12 most frequently recorded crimes, the North Side’s arrest rate is higher for 10 of them. Additionally, we can see that some crimes have much higher arrest proportions across sides than others: narcotics and weapons violations have higher arrest proportions than other crimes. Criminal tresspass appears to have the biggest disparity between the side with the highest arrest proportion and the side with the lowest, which I will further explore later.

The North side has a higher average sentiment score for each question. Much like many of the bar collections by side in the visualization of arrest proportion across side when separated by primary type seen above, the North side is highest, then the Central side follows, and then the lowest is the South side. We can see that there is little variation in score across the different survey questions for each of the sides, although the North side appears to have the largest lead on the Central side in its Safety score.

I next considered the association between arrest proportion and average sentiment score for each district for the top 12 crimes. The 5 crimes of ROBBERY (.658005), BATTERY (0.598168), THEFT (0.495089), ASSAULT (0.459407), and BURGLARY (0.442316) had the highest correlation between a district’s average sentiment score rating and their arrest rate (with the next highest having a correlation coefficient of .26). I binned these five crimes by general type and visualized them below.

As we can see, for the crime types that are assault and battery, as well as the crime types of burglary, robbery, and theft, there is a moderate, positive, and linear assocition between the proportion arrested and average sentiment for a district. Lastly, I wanted to take a look at which crime-district combinations had the highest disparity in arrest proportion. Below is a dataframe displaying the , for the top 12 crimes with the most occurences.

District of Max Arrest Proportion District of Min Arrest Proportion Difference
Primary Type
WEAPONS VIOLATION 18.0 17.0 0.516396
CRIMINAL TRESPASS 16.0 22.0 0.449882
OTHER OFFENSE 14.0 3.0 0.272351
ROBBERY 16.0 5.0 0.134063
NARCOTICS 19.0 6.0 0.116667
ASSAULT 1.0 12.0 0.085086
BURGLARY 18.0 4.0 0.083151
BATTERY 1.0 7.0 0.078671
THEFT 1.0 7.0 0.072987
CRIMINAL DAMAGE 16.0 2.0 0.036573
DECEPTIVE PRACTICE 22.0 25.0 0.028950
MOTOR VEHICLE THEFT 20.0 9.0 0.028729

From the dataframe above (sorted by the difference in arrest proportion), we can see that, of the top 12 crimes, weapons violation and criminal tresspass have the highest arrest disparity, followed by robbery and narcotics. These top 4 all have a difference of over 10%.

8 Conclusions & Recommendations

Our individual analyses answer the broader topic of how to promote personal and community safety and welfare within Chicago. This plays into people’s satisfaction with policing and how to improve these sentiments, along with suggestions on how people should look out for themselves when traveling or living in the city. When examining the various trends yielded by our analyses, it is clear that across theft, general crime, and CTA crime that rush hour and midnight are the most dangerous times. Additionally, theft is very common across Chicago, whether it be on the street, in residential homes, or transportation areas, so stakeholders should be vigilant of our possessions, and can feel less anxious about murder, for example, which only makes up 0.3% of total crimes.

Alyssa’s recommendation
text

Grace Chang’s recommendation
Next, based on the analysis of theft crimes, it is recommended to stakeholders—anyone who frequents or resides in Chicago—that they should pay more attention to their personal belongings in the region consisting of The Loop, Near North Side, Near West Side, and West Town. This region is popular for travel, as it includes financial districts and tourist attractions such as the Magnificent Mile, the Bean, and more, thus there are many stakeholders who are affected by this result. Furthermore, seeing that 33% of all thefts are thefts of financial assets over 500 dollars, and 28% are thefts of under 500 dollars, it is essential to be attentive about one’s financial possessions. Meanwhile, pick-pocketing, for example, only represents a small percentage—5.16%—of total theft crimes, so stakeholders can be assured that this crime is less common, contrary to common assumptions that pick-pocketing is a heavy concern when it comes to theft.
There are a few limitations that stakeholders should keep in mind: This analysis does not include motor vehicle theft, another common type of theft, because motor vehicle theft has its own subsets of theft types that clash with the general theft category or overwhelm it, such that it became difficult to perform deeper analysis on the general theft category. Additionally, within these community areas there are neighborhoods that can vary in crime rates, but these go beyond the scope of our research and dataset, so stakeholders should do further analysis on a particular neighborhood they are visiting.

Grace Shao’s recommendation
For the CTA, it is clear that the Loop has the highest amount of crime by far. The Loop has more than 4x as much crime as the next most dangerous community area. Since the Loop is a popular tourist area, with landmarks such as the river walk, Art Institute, and Cloud Gate, many students and stakeholders may be traveling there. It is important that people stay alert when in The Loop, especially in the Roosevelt and Jackson stations. Since theft is much more likely to occur on the train than the platform, watch your belongings closely on the train. Compared to theft, physical or violent crimes have a higher chance of happening on the platform. Avoid making contact with others on the platform and leave space between you and others.

As for specific stations, avoid Roosevelt, 95th/Dan Ryan, and Jackson when possible, especially around midnight and 6-7 pm, when crime rate peaks. Chances of crime on Roosevelt are 7.68x higher than the average station. By following these recommendations, stakeholders can stay safe while traveling in the city.

Paisley’s recommendation
In regards to the police stakeholders, police should allocate resources, as well as more research into demographic information and district needs to pinpoint the roots of the disparities in arrest rates across districts for the same type of crime–namely the crimes of weapons violations in districts 17 and 18, criminal tresspass in districts 22 and 16, robbery in districts 5 and 16, and narcotics in districts 6 and 19, which all have arrest proportion disparities of >10% across the named districts.
Particularly, as seen in the associations between a district’s arrest proportion and it’s sentiment rating, robbery has the highest correlation and also is in the top 5 crimes with the highest arrest disparity. Thus, police should allocate resources to prevention of robbery in district 5, as well as further consider their arrest tactics and get community input to aim for higher sentiment scores. (Note: District 5’s lowest robbery arrest proportion is followed by district , and district ), so this recommendation extends to these districts.

Appendix

  • Alyssa & Grace S. address the previous year’s findings
  • Alyssa’s rush hour analysis (perhaps say in the report that the grader should check appendix for further information

Previous year’s CTA findings: While the previous year did not analyze all of the stations, they did single out Howard station as one of interest, given that students used it frequently. They found that crime on Howard peaked just after midnight and around rush hour, which matches with the analysis I did on the top 3 most dangerous stations. In those stations, the most dangerous times were also around 1-2 AM and rush hour.